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ABSTRACT 

This study was undertaken to assess the accuracy of 
princip.. judgments of the effectiveness of the teachers they 
supervise.. Each of 46 principals was asked to fill out a brief form 
judging the overall effectiveness of each of the teachers in his or 
her school. The form asked how effective the teacher was in 
performing three roles: (1) promoting academic goals, (2) promoting 
affective goals, and (3) performing other professional functions. 
Each principal's judgments of teachers of a single grade were 
intercorrelated with expected achievement gains of pupils of high, 
average, and low ability in the teachers' classes. Analytical 
procedures similar to those used in "meta-analyses" were used to 
examine the resulting large set of correlations. Findings revealed 
that the relationship between principals' judgments of teacher 
effectiveness and pupils' gains on achievement tests is very low. The 
factor most closely related to the magnitude of the correlation 
between principals' judgments and pupils' gains was the qrade taught 
by the teachers rated. Other factors tested that were found not to be 
significantly related to the size of the correlations were pupil 
ability, subject taught, teacher role judged, and interactions 
between and among these factors. Tables and notes are 
included. (TE) 
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A STUDY OF THE CORRELATION BETWEEN PRINCIPALS' 
RATINGS OF TEACHEft EFFECTIVENESS AND PUPIL GROWTH* 

ABSTRACT 

The study reported hsrs was undertaken for ths primary purpose 
O of assessing ths accuracy of principals' judgments or opinions of 

s£) ths effectiveness of tsachsrs thsy supervise. By far the 

^ principal basis for psrsonnsl dscisions about tsachsrs is s 

rating of saCh tsachsr involvsd mads by ths teacher's principal 
< or his or her assistant. Sines, bscauss of ths well -known "halo 
Xi ■ the principal's overall opinion of the teacher rated is 

a major dstsrminant of the rating that teacher receives, the 
Q question whether these opinions are valid is an important 

UJ question to ask. 

Relatively -few attempts have been made in the past to validate 
principals' judgments or ratings against measures of teacher 
effectiveness based on achievement gains of pupils in their 
classes? and those few attempts have consistently failed. The 
clear implication is that neither the judgments nor the ratings 
are accurate that many decisions based on them are wrong 
decisions. It seems time someone designed and conducted a new 
study of the problem, one that would give the principals' 
judgments every possible chance to prove themselves valid if 
indeed they are. 

Design of the Study. The study that will be reported here 
differs from those that have gone before it in that instead of 
correlating judgments of teachers of different grades made by 
different principals with measures of teacher effectiveness, only 
judgments made of teachers of the same grade i bv the same 
principal were used. It also differs in that the 46 principals 
studied were asked to record their overall judgments rather than 
recording judgments on several characteristics on a multi factor 
rating scale. Finally, the procedure used to estimate teacher 
effectiveness was different from, and possibly more valid than, 
se used in past studies. 

och principal in the study was asked to fill out a brief form 
indicating how effective each of ths tsachsrs in his or her 
school whose effectiveness he or she felt capable of judging. 
The form asked how effective the teacher was in performing three 
roles : (I) promoting academic goals, (II) promoting effective 
goals, and (III) performing other professional functions. 

Each principal's judgments of teachers c? a single grade were 
then intercorrelated with expected achievement gains of pupils of 
C\J high, average, and low ability in the teachers' classes. Since 

Cn the number of classes per grade tended to be small in most 

schools, this msant that any one validity estimate was highly 
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unstable because it was based on a very small group o-f teachers. 
(The average number q-f teachers per correlation in the B7 grade 
groups was, in -fact, only 3.7.) The number o-f correlations 
estimated, on the other hand, was quite large. Twenty— four 
correlations were calculated -for each principal and grade group 
so that the most o-f the mean correlations estimated were quite 
stable. Analytical procedures similar to those used in 
"meta-analyses" were used to examine this large set o-f 
correlations. 

Findings. The mean correlation between a principal's judgment 
o-f the role I e-f f ecti veness o-f teachers (of the same grade and 
subject) and measured teacher e-f -feet i veness with pupils o-f 
average ability was only .20, and di -f -f erences in the mean 
correlations -for different principals were not statistically 
significant. There was, there-fore, ' no reason to disagree with 
the conclusions o-f previous studies: that the relationship 
between principals' judgments o-f teacher e-f -feet i veness and how 
much their pupils gain on achievement tests is very low. 

The -factor most closely related to the magnitude o-f the 
correlation between the principals'" judgments and pupil gains was 
the grade taught by the teachers rated. Other -factors testec 
which, were not -found to be signi-f icant.ly related to the size o-f 
the correlations were pupil ability, subject taught, teacher role 
judged, and interactions between and among these -factors. 

Principals' Ratings. Because a substantial number o-f the 
schools in the study were located in Georgia, a unique 
opportunity arose to study principals' ratings. As part o-f the 
process o-f teacher certification, all beginning teachers in these 
schools were observed and rated by their principals (and two 
other raters) on the TP A I (Teacher Per-f orr.iance Assessment 
Instruments), a particularly well constructed rating scale. H 
and when the state department o-f education makes these ratings it 
will be possible to study the relationship between such ratings 
and principals' overall impressions o-f the e-f -feet i veness o-f the 
teachers rated, as well as to assess ' the validity o-f the ratings 
direct 1 y. 



INTRODUCTION AND OVERVIEW 



It is difficult to ovtritate the importance to public education 
of economical, accurate and practicable procedures for evaluating 
teachers. Effective operation of the educational enterprise (or 
any other) requires that all personnel be used efficiently. This 
in turn requires accurate and timely personnel decisions which 
depend on admi ni str a tors* abi 1 i ty to di sti nguish more ef f ecti ve 
teachers from less eff ecti ve ones quickly, economically, and 
(above all) accurately. As we shall see, what evidence there is 
indi cates that such di sti ncti ons are not possi bl e wi th the 
methods of teacher evaluation in current use. 

The vast maj or i ty of personnel dec i si ons made in educati on 
(like those in such other fields as business, industry and the 
mi 1 i tary ) , are based on subjective judgment* of empl oyee 
competence made by immediate supervisors and recorded in the form 
of ratings. The validity of such ratings and the accuracy of 
decisions based on them depends very much on how good a judge of 
competence the rater happens to be. 

The use of ratings can be defended only if we are willing to 
assume that the principal or other person who supervises teachers 
is an expert judge of teacher effectiveness, that most or all of 
his or her judgments are valid. That this is true is taken for 
granted; how expert any particular principal is, or principals in 
general are, is a question no one ever seems to ask. 

A few studies which did ask this question were done some years 
ago. All of them reached thfe same conclusion: that the validity 
of a rating made by the average principal is near zero. The 
implication is clear: that teacher personnel decisions based on 
pure chance mould be Just about as accurate as decisions based on 
principals' ratings are/ 

Since the most recent of these studies was done more than a 
quarter of a century ago, using methodology then available, now 
seems to be a good time to reopen the question, to do a new 
study. This report will describe such a study, a project in 
whi ch we col 1 ec ted new data and appl ied a modern statistical 
design, one free from certain methodological limitations of the 
earl ier studies. . 

Statement of the Problem. The main question this study was 
designed to answer is: Hon valid are principals' judgments of 
teacher effectiveness? Three related questions also investigated 
are: fire some princi pal s' judgments more valid than others 9 ? 
Hhet are some of the factors which affect the validity of 
principals' judgments? and How much effect do princi pal s' 
overall judgments have on their ratings of teachers on 
mul ti- factor rating scales? 
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Justification. Being able to distinguish more effective 
teachers -from less effective ones is the key to bringing about 
those improvements in the education of children that depend on 
the quality of the teaching in the schools. The current concern 
Kith the competence of teachers and the demand for higher 
standards, merit pay plans and the like is new only in being 
noisier than a continuing concern on the part of the public and 
•the professions as well. Its solution depends almost entirely on 
being able to evaluate teachers accurately, an ability whose lack 
neither the public nor most educators seem to suspect. The 
complete failure of past attempts to establish the validity of 
the ratings universally employed to accomplish this task makes it, 
imperative to discontinue their use unless or until evidence of 
their validity is obtained. 

It is just possible that the failures of previous attempts to 
validate principals" judgments were due in whole or in part to 
defects in the designs of the studies, that the judgments were 
valid but their validity was not detected. In any new study, 
therefore, it seemed important to take particular dare to corrBCt 
these defects and to give the principals' judgments every 
possible chance to prove themselves valid (if indeed they are). 
The. study therefore involves some methodological innovations. 

Sample. The sample of principals and teachers used in the study 
was drawn from elementary schools in the southeastern United 
States, a substantial number of which were located in Beorg'ia. 
The sample used contained 46 principals and 322 teachers. 

Methodological Innovations. The traditional approach to the 
problem of validating principals' ratings has been to correlate' 
ratings of teachers of various grades »nd in various schools made 
by their principals on one hand with measures of teacher 
effectiveness based on test scores of the pupils they teach on 
the other. The same basic approach was used in this study, but 
it was modified in two important respects. Each estimated 
correlation was based on a sample of teachers of the same grade 
in the same school. Because of this, no principal was asked to 
compare teachers of different grades, and validities of ratings 
made by different principals were estimated separately. Finally, 
the estimates of the effectiveness of all teachers of the same 
grade were based on gains of pupils of the same level of ability 
instead of on the average gain of all pupils in a teacher's 
class. 

As part of the process of being certified competent to teach in 
that state, all beginning teachers in Georgia' schools are rated 
by their principals (and two other raters) on the 7PA1 (Teacher 
Performance Assessment Instruments), a multi-factored 
behaviorally anchored rating scale. The willingness of the 
Georgia department of education to release these ratings to us 
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makes it possible to examine the relationship o-f ratings made 
with one o-f the. most care-fully constructed behaviorally anchored 
rating scales in existence to the raters' overall . judgments of 
the e-f -feet iven ess o-f the teachers being rated, as well as to 
gains of pupils in their classes. 



REVIEW AND CRITIQUE OF RELATED LITERATURE 



There would be no sense in repeating a study that had already 
been repeated several times with consistent -findings unless, there 
were some reason to expect a different result this time. In the 
f ol 1 owing • pages we propose to demonstrate that there is such a 
reason by briefly reviewing and discussing past research in the 
validity of principals' evaluations of teachers. In particular, 
we will point out some methodological problems with these 
studies, especially in the procedures used to derive measures of 
teacher effectiveness from measurements of pupils' gains on 
achievement tests, problems which will be avoided in this study. 
We will discuss studies of the validity of the TPAI separately, 
for reasons that will become apparent later. 



Studies of the Validity of Principals' Ratings 

The focus of interest here is not so much on the validity of 
principals' ratings of teachers as such as on the validity of the 
overall opinions principals form of the effectiveness of teachers 
being rated. It is our contention that the principal's overall 
impression of a teacher's effectiveness (often called "halo') is 
the principal determinant of his ratings of that teacher. 

The Halo Effect. The multi-factor teacher rating scale seems to 
have become popular with educators around the year 1915 CI]. 
Instead of recording his overall judgment of the effectiveness of 
the teacher being evaluated, The principal (or other person) 
using such a rating scale records separate judgments of the 
status or level of the teacher being rated on a number of 
different characteristics, each of which is supposed to be 
related tD teacher effectiveness. These separate ratings are 
then summed (or combined in some other way) to yield an overall 
indicator of the effectiveness of the teacher being rated. 

The teacher rating scale was emoraced enthusiastic" 1 ly and 
promptly by educators C23, and is still used almost everywhere to 
evaluate teacher competence, teacher performance, and teacher 
effectiveness as well. 

The influence of the rater's general impression of the 
competence or effectiveness of the person being rated was 
recognized very early C33, and came to be known as the "halo 
effect." C43 The high i ntercorrel ati ons typically found among 
ratings of the same teacher on widely disparate characteristics 
give evidence of the strength of this effect, the validity of 
the total or composite scores teachers get on a multi factor 
teacher rating scale may, and probably does, depend more on the 
validity of principals' overall judgments of the teachers than on 
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the degree to which they possess any of the characteristics or 
•factors listed on the instrument. 



Since researchers in the pas\ have usually asked principals to 
record th-»ir judgments on multi-factor rating scales rather than 
as global judgments, research on the validity o-f principals' 
ratings provides the best information available about the 
validity o-f principals'" judgments o-f teacher effectiveness. 

Nine Studies. A search o-f the literature has turned up no more 
than nine published studies in which principals'" ratings o-f 
teachers have been correlated with measures . o-f gains in test 
scores o-f pupils in their classes. C53 

None o-f these studies was originally designed to test the 
validity o-f principals' ratings. The validity o-f the ratings 
seems to have been taken -for granted by the researchers. who 
looked upon principals' ratings and measures o-f pupil gains as 
alternative "criteria o-f teacher e-f -f ecti veness" with which 
measures o-f various other • teacher characteristics could be 
correlated to find out whether they were related to teacher 
effectiveness. Before doing so the authors of each of the nine 
studies chose to i ntercorrel ate these alternative criteria with 
each otner. 



All nine studies reached the same conclusion: that the 
correlation between principals' ratings and measures o-f teacher 
effects on pupils is close to zero. In other words, the average 
validity of principals' ratings is close to zero. Figure 1 
quotes the conclusions stated by the author of each study 
verbatim. Such unanimity is rare in educational research. 

Barr's Conjecture. In discussing their findings, Barr and his 
colleagues suggest that the validity of a principal's ratings may 
depend on who the principal is; that, even though the average 
validity in the population of principals is low, there may be 
some principa s who are better judges of- teachers than most, and 
whose ratings are valid. If this were so, it would be important 
to identify these principals, to find out how they differed from 
the others, and to train these other principals to imitate them. 
This is one of the questions the present study attempted to 
answer '. 
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CONCLUSIONS REACHED IN NINE STUDIES THAT ATTEMPTED 
TO RELATE PRINCIPALS' RATINGS OF TEACHERS 
TO MEASURED GAINS OF PUPILS IN THEIR CLASSES 



1. Anderso.-, 1954: "... no appreciable relationships exist 
between rating criteria and pupil attainment criteria." 
(p. 67. j 

2. Barr et al . , 1935: "The observed coef f i cients of 
correlation between the measures of teaching ability and 
the three measures of gain in pupil achievement are 
uniformly low." (pp. 107-103.) 

3. Brookover, W.B: "Employers' ratings of teaching ability are 
not related to pupil gains in information." (p. 205). 

4. Gotham. R.E: "... the criterion of pupil change apparently 
measures something different from that measured by teacher 
ratings. " (p. 165) . 

5. Hellfritsch, A.G. "Teacher rating scales ... are only 
slightly related to the observed pupil growth." (p. 199). 

6. Jayne, CD: ... supervisory ratings... seem to lack 

r * li i5 ility * nd v * lic Jity Cas measures of pupil gain:, 
(p . 133) . 

7. Jones, R.D. "Whatever pupil gain measures in relation to 
teaching ability it is not that emphasised in supervisory 
ratings." (p. 98). 

8. LaDuke, C.V. ...supervision ratings here provided are 
invalid Cas predictors of pupil gain. 3 (p. 97). 

9. Lins, L.J. "The three rriteria. . . Cpupil gain. pupil 
evaluations of the tc. -r , and a composite of five 
supervisory ratings] ar - . related to a greater degree 
than can be attributed to chance." (p. 59). 

10. Medley and Mitzel, 1959: "The results of the present study 
... suggest that supervisory ratings do not correlate with 
Cpupil : growth..." (p. 244). 



FIGURE 1 
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Procedures for Estimating Teacher Ef f ecti veness 

It seems clear -from the -foregoing iscussion that the 
correlation between principals" judgments of teacher 1 
ef f ecti veness and measures of teacher ef f ecti veness based on 
measured achievement gains o-f pupils Mend to be very low. It is 
natural to attribute this low relationship to di -f -f i cul t i es 
principals have in distinguishing more e-f-fective teachers -from 
less e-f-fective ones; that is, to say that the judgments are not 
valid. But it is certainty possible that the low correlations 
may be due, at least in part, to de-fects in the measures o-f 
teacher e-f -feet i veness, that they lack validity. Let us consider 
this possibility. 

Validity of Direct Measures o-f Teacher E-f -feet i veness. The 

validity o-f a measure o-f a direct measure o-f teacher 
e-f -f ecti veness, that is, one based on pupil gains on achievement 
tests, depends on two things: it depends -first o-f all on the 
validity o-f the test or tests used as measures o-f achievement of 
the objectives the teacher is or ought to be working toward; and, 
second, on the degree to which it succeeds in isolating that part 
o-f the gains pupils make that results -from the efforts o-f the 
teacher -from that which would have taken place anyhow. 

Let us begin by assuming the, the tests administered to the 
pupils are valid measures o-f objectives the teacher is expected 
to achieve. This assumption has been questioned by some on the - 
grounds that the content o-f the items on the test may not 
coincide exactly with the items o-f content the teacher actually 
teaches. Our reasons -for rejecting this notion will be given 
later. The assumption seems reasonable enough when, as is the 
case in this study, the test used is one adopted by the local 
school system as an appropriate measure o-f systems-wide goals. 

Isolating the Teacher's Contribution. Meeting the second 
condition is more difficult. If it were possible to assign 
pupils to classes randomly, so that at the beginning of the 
school year the classes taught by different teachers would differ 
only by .hance, there would be no problem. Any differences in 
post-test scores of pupils in different teachers' classes beyond 
those attributable to chance could safely be attributed to 
differences in the effectiveness of the teachers of those 
classes. But when pupils are not randomly assigned to classes, 
the classes differ at the beginning of the year in unknown ways 
and to an unknown degree. It is therefore necessary to 
distinguish among the differences found at the end of the year 
those that merely reflect differences .that existed at the 
beginning of the year from those that did not, and somehow 
measure the latter in isolation from the former. 

In past studies of teacher effectiveness, the basic approach to 
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this problem has been to estimate the mean achi evement gai n of 
all o-f the pupils in each teacher's class. and then to 
compensate -for di f f erences . between the pupils in different 
classes by statistical adjustments. 

We will introduce and use a different approach entirely. But 
before doing so, let us brie-fly review and comment on the most 
common procedures used in the past. All of them begin by 
regressing posttest scores on pretest scores and predicting the 
mean posttest score in each teacher's class with the regression 
equation. The difference between the mean of the posttest scores 
tne pupils in a teacher's class actually earn and mean of their 
. predicted posttest scores is used as a measure of that teacher's 
effectiveness. 

Residual Gains. The main differences in the three techniques 
that have been used is in how the regression line is estimated. 
In the earliest method, called the residual gains method, the 
regression was estimated from the variance and covariance between 
classes; that is, by intercorrel ating class mean pretest scores 
wit class mean posttest scores. Mitsel and Gross, in their 
classic paper on the topic Co] reject this procedure on the 
grounds that it adjusts ^out some of the differences between 
classes that it is supposed to estimate. 

Adjusted Mean Gains. Mi tsel and Gross recommended, instead, the 
use of adjusted mean gains, that is, that the regression be 
estimated from pooled within-class variance and covariance,. 

Multiple Regression. More recently, some investigators have 
used total variance and covariance in a multiple regression in 
which pretest scores are entered first, then the variables with 
which teacher effectiveness is to be correlated. 

Unfulfilled Assumptions. Use of any of the*:? techniques is 
based on two assumptions that are rarely if ever fulfilled in 
practice. One is that the pupils have been randomly assigned to 
the classes of the different teachers', the other is that the 
regression slopes with classes are equal. As we have already 
noted, random assignment of pupils rarely happens. It is, of 
course, impossible unless the sample of teacher studied consists 
of teachers of the same grade and subject in the same school, 
because pupils cannot be assigned to grades, subjects, or schools 
at random. 

The assumption that regression slopes (and. therefore 
pretest-posttest correlations) within classes are equal is 
testable} and when it is tested is usually found to be false. 
The correlation between pretest and post test scores within a 
teacher's class is, in fact, a characteristic of the teacher that 
is important in its own right, since it reflects the degree to 
which the effectiveness of the teacher varies with pupil 
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ability. A positive slope indicates a class \n which high 
ability pupils gain more rapidly than low ability pupils do; a 
zero slope indicates a class in which all pupils gain at the same 
rate, etc. 

Fitting a single regression line to the pupils in different 
classes not only results in a poor -fit,, then," but it also 
conceals important informati on about teacher ef f ecti v«ness. 

Regression Artifact. The most defensible of these three 
procedures is, of course, the analysis of covariance, which does 
not confound between-class and within-class covariation. This 
procedure has also been widely used, in quasi -experimental or eve 
post facto studi es, ones in whi ch subjects are nor randoml y 
assigned to treatments, to achieve- the same purpose, that is, to 
compensate for pre-e w isting differences between groups. 

It has been shown, however, that because of an artifact of 
regression, 171 when this procedure is used with groups that 
differ initially it has the opposite effect to the one intended. 
That isi' at increases the bias it is supposed to reduce. 

What is important to us is that, sin. * all nine of the studies 
cited earlier used procedures of this type, it is possible that a 
bias in the estimates of teacher effectiveness may have concealed 
the validity of principals 11 ratings in all of these them. To 
avoid this possibility, in the present study we will use a 
procedure different from any of those described, one which avoids 
both of the untenable assumptions implied in the use of the 
procedures described above. 



Validity of the TPAI 

Among many i -tempts to control or eliminate the halo effect, 
one of the most promising has been the use of "behavior anchors" 
on the separate scales of a mul ti factor rating scale. A behavior 
anchor consists of one or more specific examples of behaviors 
typical of teachers at a specific level on the dimension the 
scale is intended to measure. Their inclusion is intended to 
increase the accuracy of ratings on a subscitie by clarifying and 
simplifying the task of the rat«r.C63 

Of special interest in this investigation is the carefully 
constructed behavioral ly anchored rating scale (or set of scales) 
called the Teacher Performance Assessment Instruments (TPAI). The 
TPAI was developed, and for several years has been used, for 
certifying beginning teachers in the state of Georgia. [93 

Studies of Validity o* the TPAI. A series of studies of the 
predictive validity of the TPAI has been repeated at various 
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research meetings* we propose to review these studies here. C 103 
Because they mainly report correlations between pupil gains and 
scores on individual TPAI items or competencies instead of total 
scores, results o-f these studies do not shed as much direct light 
on the question addressed by the present investigation as we 
mirjht wish. They do not tell us as much about the validity of 
principals' judgments o-f the *f f ecti veness o-f teachers as the 
nine studies already discussed. But they do bear directly on the 
questions about the accuracy o-f decisions about educational 
personnel with which this study is concerned. 

Measure of Teacher Effectiveness. Three kinds of tests have 
been used in these studies to measure teacher effectiveness: 
standardized tests, teacher-made tests, and criterion referenced 
tests. By and large the correlations reported are correlations 
between measures of teachen effectiveness and scores on single 
•TPAI items or competencies rather than total scores. The results 
obtained seem to depend on the kind of test used. When 
standardized tests were used, the correlations obtained are 
described by the authors as "mixed." When criterion-referenced 
tests are used at least some of the correlations reported tend to 
be significant. And when teacher-made tests are used, many more 
correlations are significant. 



Test Content and the Nature of Effective Teaching. These 
authors raise * familiar objection to the -use of standardized 
test scores of pupils to estimate teacher effectiveness, the 
objection that because a standardised test may not measure the 
exact content taught by the teacher, it is not a valid basis for 
assessing teacher effectiveness. This fallacy reflects a basic 
misunderstanding of the proper function of standardized tests, o r 
the nature of effective teaching and, indeed, of the purpose of 
education. 

It is the function of a standardized test to measure the 
important, permanent changes in pupils that teacher-made unit 
tests cannot measure. Growth in the ability to read critically, 
to apply the scientific method, to learn on one's own, and the 
like, is gradual, difficult to measure, and in most cases can be 
detected only over relatively long periods of time. These are 
the kinds of things teachers are hired to teach. These are the 
the kinds of outcomes that distinguish truly effective teachers 
from the rest. Th^se are the kinds of outcomes on which measures 
of teacher effectiveness should be based. 

Standardized tests are not, r- should not, be designed to 

measure pupils' mastery of the specific content of the day-to-day 

lessons or units taught in the schools. This is what the unit 

test, which is usually built by the teacher, is supposed to 

measure. Most of it will be forgotten by the pupils promptly 

once they have passed the unit test. 
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The specific content that a teacher teaches in a lesson or unit 
is a means to the ends the teacher is supposed to achieve, but 
not the end itself. Content objectives are no mare than 
"enabling" objectives! the actual content taught in a unit is not 
important and will soon be forgotten by most pupils, and rightly 
so. put in the process of learning (and forgetting) this content 
the teachers' pupils ought to learn something else which they 
will not -forget, something only one of the better standardized 
tests can measure. 

The unit tev^ts that a teacher constructs to measure how much o-f 
the content o-f the unit pupils have learned are use-ful for such 
purposes as guiding anr' motivating pupils to learn the content, 
and providing a practical basis for giving them grades. How well 
a pupil learns the content is a pretty good indicator o-f how much 
progress the pupils is making toward the important goals o-f 
educati on. 

So -far as we know, none o-f the criterion-referenced tests so 
much in vogue these days are designed measure anything more than 
the specific content teachers arc supposed to teach. It is 
important that the content of a criterion-referenced test matches 
that taught by a tiacher. But pupil gains on such tests do not 
validly indicate how effective a teacher is in performing the 
basic function of a teacher, which is to educate children, to 
change them permanently and in important ways. 

Only a standardized test, and a good one at that, is capable of 
measuring how successful a teacher is in educating pupils, and it 
can only do so by measuring changes over a substantial period of 
time, preferably a full school year. And even the best 
standardized test cannot do this when the teachers "teach to the 
test," that is, when they teach the specific content of the 
test. When that happens, the validity _<f the test as a measure 
of the important outcomes of education is destroyed; and it 
becomes, in effect, nothing more than another 

cri teri on-referenced test. 

This is one concern we have with the TPAI validity studies: 
that they fail whenever standardized tests are used to assess 
teacher effectiveness, and succeed when tests that measure only 
the pupils' immediate mastery of content are used. But we have a 
more serious problem than that. 

The Comparability Problem. Unless the same standardized test is 
administered to all classes in a study, The comparability of 
scores from different classes are not comparable unless something 
is done to make scores on different tests equivalent. The 
authors' solution to this problem was to use a statistic called 
the Index of Achievement Sain, which seems to be home-grown. A 
pupil's Index of Achievement Gain is calculated by di 'iding the 
increase in the number of items the apil answers correctly from 
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the pretest to the posttest (the actual gain) by the number o-f 
items the pupil -failed to answer correctly on the pretest <the 
possible gain). The mean o-f these indices fcv all pupils in, a 
teacher's class was the teacher e-f f ecti veness measure used with 
teacher-made tests and criterion-ref erented tests. C 1 1 3 

There is no reason to suppose that this statistic yields 
comparable scores ,-from non-comparable tests. Suppose, -for 
example, that Miss Jones' slow- learning -fifth grade pupils gain 
10'/. on her 23-item unit test on improper -fractions; and that Miss 
Smith's above- average pupils gain IS'/, on her unit test on the 
Civil war. On what basis can we conclude, as these investigators 
do, that Miss Smith is a more e-f -f ecti ve teacher than Miss Jones? 

It is puzzling and disturbing to note that it is only when 
these pjnvesti gators use this highly questionable statistic that 
they get significant correlations with TPAI scores. Whatever it 
is that indices o-f achievement gain based on non-equival tot tests 
measure, it is not the relative e-f -feet i veness o-f the teacr.ers who 
built the tests. 

It is more likely that these indices tell us something about 
the teachers' skill in constructing tests; but why should that 
correlate with scores on TPAI items? Can it be that whatever 
makes some teachers impress observers most -favorably also makes 
them write test litems on which their pupils make large percentage 
gains? Far-fetched as this explanation may be, it is mor* 
credible than the idea that these indices yield comparable 
measures o-f teacher e-f -feet i veness. 

Perhaps the best conclusion we can reach about the validity o-f 
the TPAI as a measure o-f teacher e-f -feet i veness is that the 
question is still open. the -fact that TPAI scores are used as at 
least a partial basis -for deciding whether candidates will or 
*ull not be granted teaching certificates makes it worth while to 
try once more to validate it. 



Sumaary and Conclusions 

The facts that emerge from this brief look at the literature 
clearly call into question the wisdom of the almost complete 
dependence of personnel decisions in education on principals' 
ratings. The fact is that all attempts to establish the validity 
of such ratings against criteria of teacher effectiveness based 
on measured achievement gains of pupils have been unsuccessful. 
The validity of the methods used in these studies to estimate 
teacher effectiveness are, however, open to question. Until the 
possibility that methodological shortcomings may account for 
these findings can be ruled out, however, there is a need for 
studies which arc free from these methodological flaws. 
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PROCEDURES 



In this section of this report we will describe the selection 
of the sample of principals studied, the collection of the data, 
the instrumentation, the measure of teacher effectiveness, and 
the analytical methodology of the study. 



The Sample 

The sample of principals, teachers, and pupils used was 
obtained by seeking the' cooperation of school districts in the 
southeastern United States. If a school district agreed to take 
part in the study, the next step was to find out whether the 
regular testing program in the district yielded the data needed 
an the study. If it did, each elementary-school principal in the 
district was asked to record judgments of the effectiveness of as 
many of the tsachers in his or her school as possible. Usable 
data were obtained from 46 principals on 322 teachers. 



Data Collection 

Each principal in the sample recorded his c:r her judgment of 
the effectiveness of each teacher he or she supervised on a 
simple form. A roster of each class was obtained that showed the 
fall and spring scores of each pupil in that class on whatever 
test battery was used in the regular testing program in the 
district. The state education department of the state of Georgia 
kindly consented to provide us with ratings of any of our 322 
teachers who were first-year teachers in the state of Ge a 
that they had obtained (although they have not yet done so). 



Instrumentation 

Three instruments were used in the study: the form on which 
principals recorded their judgments of teachers, the achievement 
tests administered to the pyupils in the 322 classes, and the 
rating scale used in the Georgia certification program. 

Principles' Judgments. The instrument on which the principals 
were asked to record their judgments was a simple form used in a 
study reported in 1959. C13 (See Figure 2.) On it the principal 
indicates where the teacher would stand in comparison with a 
typical group of 20 teachers of the same grade on three Voles 11 a 
teacher is expected to perform, defined in Figure 2. 
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INSTRUCTIONS TO PRINCIPALS 



Teachers in today's schools must perform competently in 
at least three roltt in order to be successful. You are being 
asked to share with us your best judgment as to how well- the 
teacher named above fulfills each of them in your school as a 
teacher of the subject named. 

Please indicate your judgment by writing a number between 
one and twenty in the space before the description of each role 
printed below. The number should indicate where you think the 
teacher would rank in a representative group of teachers in that 
subject and grade. If the teacher performs better than all the 
rest* write 2Q> if *1 1 the others perform better than this 
teacher, write i; and so on. 

All ratings. will be kept confidential; no one except the 
clerk who transcribes the data (and removes all names) will know 
the name of either the teacher or the principal involved. These 
sheets will be destroyed as soon as the data have been 
transcribed. 

___ROLE I The teacher is responsible for providing learning 

experiences which result in pupils' acquisition of 
fundamental knowledge. 

ROLE II The teauher is responsible for providing children 

with learning experiences which lead to good 
citizenship, personal satisfaction, and self 
understanding. 



ROLE III The teacher is a professional colleague of other 

teachers, supervisors, and administrators. 



FORM ON WHICH PRINCIPALS RECORDED 
THEIR JUDGMENTS OF TEACHER EFFECTIVENESS 



FIGURE 23 
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Achievement Tests. In each school, the reading and arithmetic 
subtests of the battery used in the regular testing program in 
the school were used to measure achievement gains of pupils. 

In the conventional study of teacher ef f ecti veness, in which 
teacher ratings and teacher effectiveness measures are 
inter correlated across schools, it is necessary to use the same 
tests in all classes so that the teacher effectiveness measures 
are comparable. But since all correlations in this study were 
calculated in groups of teachers of the same grade in the same 
school, it was not necessary to use the same test in every 
school. Instead, the test used in each school was the one chosen 
by that school as most appropriate. When we asked a principal 
how effective a teacher was, we meant how effective in terms of a 
test already in use in that school with which both the principal 
and the teacher were already familiar, and one which presumably 
measured the goals of the school. 

Rating Scale. The rating scale used in the second phase of the 
study was the TPAI Teacher Performance Assessment Instruments 9 
which was developed and is used in Georgia as one of a number of 
instruments used as a basis for certifying teachers in the 
state. It was chosen mainly for the reason already given; that 
ratings made of beginning teachers were on file and available. 
It would have been an excellent choice in any case since it is 
one of the most carefully constructed and widely used 
behavioral ly anchored multi-factor rating scales in existence. 



Expected Gain Scores 

The measure of the effectiveness of each teacher that was used 
in this study was the Expected Gain Score of a pupil with a 
specified level of ability as indicated by his or her pretest 
score on the test used to measure achievement gains. Since this 
measure has never to our knowledge been Used before for this 
purpose, we propose to describe it here in some detail. 

A pupil's Expected Gain Score or EGS is an estimate of the 
score he or she will earn at the end of the school year; it 
depends, among other things, on the pupil's ability and on which 
teacher's class he or she is in. Diff#rences between scores the 
same pupil would be expected to get in different teachers' 
classes will be used as measures of differences in teacher 
effectiveness. 

How is the score the pupil will get at the end of a school year 
in a teacher's class (his or her EGS) estimated? By entering the 
pupil's pretest score into a simple linear regression equation 
based On the correlation between the pretest and posttest scores 
of all of the pupils in that teacher's class. Such a regression 
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equation looks like this: 



y ■ a «■ bx 

The values of a ano b, known as the regression coeffic ients , 
depend op which class the pupil is in. The value of x, the 
pretest score, can be anything you choose within the range, of 
scores dn the test. From these three numbers the value of y, the 
ESS, can b» calculated. Since the val ues of the regressi on 
coefficients a and b will differ from one teacher's class to 
another,, the EGS score obtained with any given pretest score will 
di f f er for di f f erent teachers. In other words, pupi Is with 
identical pretest scores will get different EGS's, will learn 
different amounts, in di ff erent teacher * s cl asses. 

The actual posttest score that any individual pupil with a 
given pretest score gets at the end of the school year may or may 
not equal the predicted posttest score. or EGS; pupils with the 
same pretest score will differ in other wavs that affect the 
amount they learn. But the average posttest score of a large 
number of pupils with that pretest score scores would equal the 
predicted value, the EGS. In other words, the EGS is an estimate 
of the mean posttest score in a population of pupils with the 
same pretest score. 

While any arbitrarily d:osen pretest score may be used, the 
average pretest score in some specific group is of greatest 
interest in most cases. Suppose, for example, that the mean 
score of all fifth-grade pupils in a school district on the 
pretest is substituted in a regression equation obtained in Mids 
Jones' fifth-grade class and in Miss Smith's fifth-grade class. 
Suppose that the EGS obtained in Miss Jones' class is 54 and that 
obtained in Miss Smith's class is 47. This indicates that the 
average pupil in that school system would gain 7 points more in 
Miss Jones' class than in Miss Smith's.- -Wi-th in th*-JJjal_tj»__of_tl , ie 
errors of measurement, we are justified in concluding that Miss 
Jones is more effective with the average pupil than Miss Smith. 

In general, the teacher in whose class a pupil with a 
particular ability level (as measured on the pretest) would get 
the highest posttest score will be regarded as the teacher who is 
most effective with pupils at that ability level .Such EGS's are 
comparable for teachers in the same grade because the pretest 
scores are identical for all teachers. They are not usually 
comparable for teachers of different grades, however, because the 
average pretest score will differ for different grades. 

Pupil Ability and Teacher Effectiveness. Some of the research 
suggests that whi ch pattern of cl assroom behavi or is most 
effective in promoting pupil gains in achievement depends on the 
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ability of the pupil C23. If this is so, then one cannot assume 
that a teacher who is most effective with one type of pupil, such 
as the average pupil in a school district, is necessarily the 
most effective with all kinds of pupils. 

Thi s possi bi 1 i ty has many i mportant and di sturbi ng 
implications. One is that it does not make muCh sense to ask a 
principal to judge the effectiveness of a teacher without 
specifying the kind of pupil to be affected. It might be that 
one principal bases his judgments on how effective a teacher is 
with low-ability pupils while the researcher was measuring how 
effective each teacher is with pupils of average ability. 

For this reason, we estimated not one but two EGS's for each 
teacher, one for pupils whose pretest score is one standard 
deviation below the mean of the distribution of all pupils in the 
grade and school, and one for pupils whose pretest score is one 
standard deviation above the mean of the same distribution. The 
first pretest score was at the 16th percentile and the second at 
the B4th percentile of the distribution. so the first group of 
pupils will be r&ierred to as "low-ability" pupils and the second 
as "hi gh-abi 1 i ty" pupils. Because the regression is linear, the 
mean of these two EGS's is the EGS of pupils of average ability^ 

To sum up, then, we had three measures of the effectiveness of 
each teacher; one with low-ability pupils, one with high-ability 
pupils, and one with pupils of average ability. A correlation 
between a principal's judgments and any one of these will be 
interpreted as evidence that; his judgments are valid. 

As an measure of teacher effectiveness, an EGS score is subject 
to measurement error. In order to obtain an estimate of this 
error, we split each teacher's class; into random halves and 
calculated not one but two regression equations per class, one 
from each half. Substituting the same pretest score in each 
equation gave us two independent estimates of the same EGS. The 
mean of the two was used as the estimate of teacher effectiveness 
wT tfi pupTFs" o-r~t he ^"evfrl of" — atri 1-tty — in*— question^ — ^and — ttie- 
difference between the two half-class values was an indicator of 
its accuracy. 

Thus there were four expected gain scores per class for each 
test, two for high-ability pupils <one in reading and one in 
arithmetic) and two for low-ability pupils, making eight in all. 
Each of these these ei ght expected gain scores was correlated 
with principals' judgments of the effectiveness in performing 
each of the three roles of the teachers in each grade in each 
school, yielding 24 correlations per grade group. If a principal 
recorded judgments on teachers in 6 grades in his or her school, 
then we calculated 24X6 coefficients for that principal. 
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Data Analysis 

The -first step in the analysis of the principals* judgments of 
teacher effectiveness in the three roles was to calculate a set 
of val idity coefficients (correlations between principals' 
judgments and expected pupil gains) for each principal. It did 
not seem reasonable to us to compare judgments of teachers of 
different grades, so separate validity coefficients were 
calculated for the teachers of each grade who were judged by the 
same principal. Correlations were calculated between judgments 
on each of the three roles and EGS's of pupils of high and low 
ability, in reading and arithmetic, in random half ~c.\ asses, 
making a total of 24 correlations for each grade judged by each 
pri ncipal , as wel 1 as mean correl ati ons for grades, subjects, 
etc. 

Because the number of' classes per grade in a school tended to 
be small, as it is in most schools, most of these correlations 
were based on rather small groups of teachers. The average 
number of teachers in one grade group was, in fact, ofily about 
3.7. The number of correlations estimated for each principal, on 
the -other hand, tended to be quite large, so that the mean 
correlation between a principal's judgments and expected gain 
scores in which we were interested was stable enough for our 
purpose. 

H 

The two main questions the study attempted to answer, how valid 
principals' judgments are on the average and whether some 
principals judgments are more valid then others' will be answered 
by examining the distributions of principals' mean correlations 
and by analysis of variance. If there are significant 
differences in the validities of judgments made by different 
principals, we will ask what lies behind those differences. 

The third question, which has to do with factors related to the 
size of ( the validity coefficients, was answered by a series of 
an a lyses— of— vari ance— one^-per— pr tfw^irp^a4-. — -The- -set™ of -c or re lat-i-on S— 
calculated for each principal was submitted to an analysis of 
variance in which the correlation between the principal's 
judgments and teachers' EGS's was the dependent variable. Pupil 
ability (high or low), Subject (reading or mathematics), Role (I, 
II, or III), and, for those principals who recorded judgments of 
teachers in two or more grades, Grede, were the independent 
variables. The design was a four-way f actori al ; C33 with the 
difference between correlations based on different halves of the 
same class provided the estimate of error. 

It shoul d be noted that this error est i mate ref lected 
variations due to sampling of pupils from the population 
represented by the pupils in the same class only; it did not 
reflect variations due to sampling of teachers. The results 
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obtained are therefore not, mtrictly speaking, general i zabl e to 
other teacher* but only to other pupil s with these same 
teachers. 

The purpose of the analysis was, of course, to examine the 
relationship between the dependent variable, the validity of the 
principal's judgment, and the independent variables as well as 
interactions between them. Since results -for any one principal 
are of little interest, after estimating the components of the 
variance in each principal's correlation coeffi cients. we 
averaged the components across the sample to estimate the average 
importance of each factor in determining the magnitude of a 
correlation between any principal's judgments and EGS's of the 
same teachers. 



Analysis of TPAI Data 

Because o-f the small number of teachers rated on the TPAI in 
any one grade and school, it would not be possible to control 
grade, subject, and pupil ability by "blocking" them in the way 
we could in our study of the overall judgments. If and when the 
data become available, We will have to settle for a simole 
correlational analysis of the sample we obtain, one in which 
teachers and principals from different schools that use the same 
test are mixed together. Since in the certification process the 
instrument is used to compare teachers from different schools 
this may nc*. be an inappropriate way to assess its validity and 
its relationship to principals' judgments. 
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NOTES 



Medley, D.M. and Mit2el, H.E. Some behavioral correlates oi 
teacher ef f ecti veness, Jo urn* J of Educational Psychol og y fl 
1959, SO, 239-246. 

Medley, D.M. Teacher Competence end Teacher Effectiveness* 
P Review of Process-Product Research Washington, D.C. , 
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19771 Lara, A. V. Pupil Ability as a Moderator of 
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For principals who rated only one grade, a three-way design 
was used. 
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RESULTS AND DISCUSSION 



Distributions of Principals' Judgments 



Before ws •>: amine ths correlations between principals' 
judgments and the expected gains of their pupils, let us examine 
the kinds o-f judgments the principals record. Table 1 shows the 
distributions o-f the judgments o-f our sample o-f teachers recorded 
by the 46 principals on the three roles. 

In these days when the public is convinced that there are so 
many incompetent teachers in the schools, these finding* might 
make ur wonder where they are.. Fewer than 13% o-f these teacher* 
were judged to be performing below average on any of Llie three 
role*. Indeed, according to their principals, these teacher* 
were a remarkable group. About half of them were judged to be 
more ef-fective than 837. of other teacher*, and 13% were judged 
superior to all other teachers! This would be heartening news if 
we could believe it; but we can not. Like most people, when 
asked to rate or judge someone; else, these 
extremely lenient. Realizing how very difficult 
such judgments as these, and knowing the impact 
have on a teacher's career, they hesitate to 



most glaringly incompetent teachers very low. 



principals are 
it is to mak* 
a 1 ow rating can 
rate any but the 



Regardless of the validity or lack of validity of principals' 
judgments, the tendency that these figures clearly show for 
principals to overrate their teachers sharply limits the 
usefulness of their ratings as a basis for realistic decisions 
about teacher personnel. It also attenuates correlations between 
the judgments and other measures, including measures of teacher 
effectiveness. 



Note that the 
identical. This 
recorded identical 
that the amount 
rol es. 



distributions for Roles II and 
does not mean, of course, that 
judgments for each teacher? but it 



III are 
principal s 
does mean 



of leniency displayed was the same on both 



In this study our interest centers primarily on judgments of 
teacher effectiveness in the first role, since it is tha 



which should relate most 



scores) , 
graphi c 
suggest 

finer 

judging. 



Figure 3 
form. Noti 
that the 
gradat i ons 



Both of these 



is the one 
closely to EGS's (expected gain 
shows the distribution of Role I judgments in 
the crude modes at 20, 18, and 15. They 
20 levels of effectiveness used represented 
than the principals felt comfortable in 
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TABLE 1 

DISTRIBUTION OF PRINCIPALS' JUDGMENTS OF 
EFFECTIVENESS OF TEACHERS IN PERFORMING THREE ROLES 

E»t i m* c »d Pm? c »n t "o?"t»*c hir a r 

Ra: k 

Rol» I Rol« II Rait III 
Ac*d«mic A-f-f»ctiv» Pro-f •»si on*l 



20 


13.7 


13.3 


13.3 


19 


9.5 


8.4 


8.4 


IS 


16.7 


18.3 


IB. 3 


17 


11.0 


9.1 


9. 1 


16 


8.0 


6.1 


6. 1 


15 


.12.9 


12.9 


12.9 


14 


3.8 


4.2 


4.2 


13 


0.4 


2.3 


2.3 


12 


8.7 


6.5 


6.5 


11 


2.7 


1.1 


1.1 


10 


7.2 


10.3 


10.3 


9 


1.9 


1.9 


1.9 


6 


1.5 


2.3 


2.3 


7 


0.4 


1.1 


1.1 


6 


0.0 


0.8 


0.8 


5 


0.4 


0.8 


0.8 


4 


0.0 


0.0 


0.0 


3 


0.4 


0.4 


0.4 


2 


0.4 


0.0 


0.0 


1 


0.4 


0.4 


0.4 



M»#n 15.6 15.4 15.4 

S.D. 3.7 3.9 3.9 
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Rank 
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DISTRIBUTION OF PRINCIPALS' JUDGMENTS 
OF TEACHER EFFECTIVENESS IN ROLE I 

FX SURE 3 
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tendencies, the tendency to overrate teachers and the tendency 
not to us* all available levels, tend to reduce the correlations 
between principals' judgments and EGS's. 



A Sample Analysis 

Before discussing how valid principals' ratings ars it will be 
useful to prsssnt an ax ample of ths kind and amount of data 
generated for each principal. The complete set of validity 
coefficients calculated for one principal, Principal No. 70, is 
shown in Table 2. Principal No. 70 recorded judgments o-f four 
groups of teachers representing four different grades. The total 
number of coefficients calculated would therefore be 96. C13 Table 
2 shows only the 48 whole-class values. 

Note that the mean o-f all 48 correlations is .321 which means 
that the average validity o-f this principal'* judgments is 
estimated to be .32. Since there is no reason to expect judgments 
on Roles II or III to correlate with EGS's, the mean Role I 
correlation, which i3 .40, is a better indicator of the validity 
of this principal's judgments then the overall mean of .32. Note 
also that Role I judgments made by this principal seem to be 
higher in grades 2 and 4, where they equal, respectively, .47 and 
.46, than they are in grades 3 and 6, where they are only .25 and 
.27. 

Correlations based on samples as small as these, which contain 
only three or four teachers, are very unstable. But these are 
the sizes of the groups of teachers principals are called upon to 
compare; this is the evaluation task principals actually 
perform. A principal is likelier to need to decide which of 
three or four third grade teachers is the most competent than 
whether a third grade teacher is more competent than a sixth 
grade teacher. 

There is considerable variation among this principal's 
correlations with EGS's in different subjects, grades, and levels 
of pupil anility. This variation was examined by means of an 
analysis of variance in a four-way factorial design as 'shown in 
Table 3. C23 

Notice that the only one of the factors studied that makns a 
statistically significant contribution tD the validity of this 
principals' judgments is the interaction between grade taught and 
ability of pupil. From Table 2 we note that the differences 
between the validity coefficients for predicting gains of 
low-ability pupils and high-ability pupils for grades 2, 3, 4, 
and 6, respectively, were -.35, -.48, -.16, and +1.32. 
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TABLE 2 



CORRELATIONS BETWEEN RATINGS OF TEACHERS AND PUPIL GAINS 
ACCORDING TO SUBJECT, GRADE, PUPIL ABILITY, AND TEACHER ROLE 

FOR PRINCIPAL NUMBER 70 





Grade 


2 








Subject 


Ability 
of 
Pupils 


Teacher Role 

I II III 


Average 

nuer 
w V Wf 

Roles 


Reading 


Low 
High 


0. 20 
0.64 


0.24 
0.85 


0.20 
0.64 


0.71 




Average 


0.42 


0.54 


0.42 


0.46 


Arithmetic 


Low 
High 


0. 19 
0.85 


0.20 
0.72 


0. 19 
0.85 


0.20 
0.81 




Average 


0.52 


0.46 


0.52 


0 . 5<"> 
v . V 


Average* over Subject* 












Low 
High 


0.20 
0. 75 


0.22 
0.78 


0.20 
0.75 


0.20 

0 7h 


Averages -for Grade 




0.47 


0.48 


0.47 


0.47 




Grade 


T 

w 








Subject 


Ability 
of 
Pupils 


Teacher Rol e 

I 'II III 


Average 
over 
Roles 


Reading 


Low 
High 


-0. 13 
0.66 


-0. 13 
0.66 


-0.07 
0.54 


-0. 11 
0.62 




Average 


0.26 


0.26 


0.23 


0.25 


Arithmetic 


Low 
High 


0. 15 
0.88 


0. 15 
0.88 


0. 15 
0.89 


0. 15 
0.8 V 




Average 


0.52 


0.52 


0.52 


0.52 


Averages over Subjects 

Low 
High 


0.01 
0.49 


0.24 
0.49 


0.21 
0.72 


0. 15 
0. 56 


Averages -for Grade 




0.25 


0.45 


0.46 


0. 39 
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Tt BLE 2 (Continued) 







Grade 


4 


• 






Subject 




Ability 
o<f 
Pupils 


Taachar Rol a 

I II III 


Avaraga 
over 
Rol as 


Reading 




Low 
High 


0.83 

SJ a X *f 


0.73 

Ve X\J 


-0.54 

V.a X X 


0.34 
0.04 






Average 


ti 49 

V a "f 7 


all 

v • ™ X 




0. 19 


Arithmetic 




High* 


-0.07 
0.94 


0.35 
0. 19 


0.28 
-0.97 


0.19 
0.05 






Avaraga 


0.43 


0.27 


-0.34 


0. 12 


Averages over 


Sub j acts 














Low 
High 


0. 38 
.0.54 


0.54 
0. 15 


-0. 13 
-0.54 


0.26 
0.05 


Averages for 1 


Brade 




0.46 


0.34 


-0.33 


0. 16 






Grade 


6 








Subject 




Ability 
o* 
Pupils 


Taachar Rol a 

I II III 


Avaraga 
ovar 
Rol as 


Reading 




Low 
High 


0.90 
-0.01 


0.90 
-0.01 


0.90 
-0.01 


0.90 
-0.01 






Avaraga 


0.45 


0.45 


0.45 


0.45 


Hi' 4 LI lint? I* •! ^ 




Low 
High 


0.96 
-0.77 


0.96 
-0.77 


0.96 
-0.77 


0. 96 

v a # w 

-0.77 






Avaraga 


0. 10 


0.10 


0.10 


0. 10 


Averages ovtr 


Subjects 

Low 
High 


0.93 
-0.39 


0.93 
-0.39 


0.93 
-0.39 


0.93 
-0.39 


Averages for 


Grade 




0.27 


0.27 


0.27 


0.27 
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TABLE 2 (Continued) 
AVERAGES OVER GRADES 



Subject 


Ability 
o-f 
Pupils 


Teacher Role 

I XI III 


Average 
over 
Roles 


Reading 


Low 
High 


0.43 
0.36 


0.44 
0.40 


0.12 
0.26 


0.34 
0.34 




Average 


0.40 


0.42 


0. 19 


0.34 


. Arithmetic 


Low 

High 


0.31 
0.47 


0.42 
0.25 


0.40 
0.00 


0.37 
0.24 




Average 


0.39 


0.34 


0.20 


0.31 


Averages over Su^ 


Jjects 

Low 
High 


0.38 
0.42 


0.43 
0.33 


0.26 
0. 13 


0.36 
0.29 


Overall Averagei 


i 


0.40 


0.38 


0.20 


0.32 
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TABLE 3 



ANALYSIS OF VARIANCE OF TEACHER RATINGS MADE BY 

PRINCIPAL NUMBER 70 



Source of Variation 


df 


Sum of 


Mean 


F 




< 


Squares 


Square 




Role Rated 


2 


0.7834 


0.393 


1.073 


Subject Tested 


1 


0.0213 


0.021 


0.058 


Grade Taught 




1 . 4227 


0.474 


1.296 


Ability ot Pupil 


1 


0.0988 


0.099 


0.270 


Interaction R X S 


2 


0.0326 


0.016 


0.045 


Interaction R X G 


6 


2, 1 678 


0.361 


0.988 


Interaction R X A 


2 


0.1202 


0.060 


0. 164 


Interaction S X G 


T? 
%.» 


1.1716 


0.391 


1.068 


Interaction S X A 


l 


0.1077 


0. 108 


0.295 


Interaction G X A 


W 


13.7792 


5. 260 


14.377* 


Interaction R X S X G 


6 


.0.0308 


0.005 


0.014 


Interaction R X S X A 


• 2 


0.6265 


0.313 


0.856 


Interaction R X G X A 


6 


0.3061 


0.051 


0. 139 


Interaction S X G X A 


. T 
w 


0.9654 


0.322 


0.880 


Interaction R X S X G X A 


6 


2.3265 - 


0.388 


1.060 


Residual Variation 


.45 


16.4630 


0.366 




Total Variation 


92 


42.4256 







*P< . 05 
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For lomt unknown ritton, an apparent general tendency of this 
principal to prefer teachers who are more effective with bright 
pupils to teacher* more effective with slow pupils seems to 
reverse itself rather dramatically in grade 6. 

Finally, Table 4 shows the estimated proportions of the 
variance in a' single validity coefficient that are associated 
with each of the 16 factors isolated in the analysis of variance 
shown in Table 3. More than half of the variance in this 
principal's correlations may be attributed to the interaction 
between grade and ability, and more than one-third to unexplained 
influences (residual variation). None of the other factors makes 
any appreciable contribution.. 



Factors in Validities of Principals' Judgments 

These results might be of some interest to Principal No. 70 as 
descriptive of hie performance with these teachers? but they are 
of little interest to anyone else because they lack 
general izabi 1 i ty. To obtain more useful results we performed an 
analysis of variance like . this of the &XJ2 correlations 
calculated for each principal in the sample (see Appendix A). 

Proportions of variance associated with each of the 16 factors 
were averaged across all principals who recorded judgments on 
teachers in two or more grades. The results are shown in Table 5 
for the 24 principals who recorded, judgments of teachers in two 
or more grades. Table 6 shows the proportions for the 8 
components of variance available in analyses of correlations for 
the 22 principals who recorded judgments of teachers in one grade 
onl y. 

For comparison, the data for principals who recorded judgments 
of teachers in two or more grades on these eight factors are also 
shown in Table 6l 

It is clear from Table 5 that grade level is the major factor 
related to the validity of principals' judgments of teachers. As 
a main effect, it accoi -»ts for more than one sixth of the 
variation? and the interaction Grade X Ability accounts for 
another tenth. In all, factors involving Grade ' account for more 
than 49'/. of the variations in validities of principal'.*' judgments 
of teacher effectiveness; and identifiable factors not involving 
Grade for less than IB'/.. This should be compared with residual 
(unexplained) variance, which accounts for 337.. 
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TABLE 4 

FACTORS IN RATINGS MADE BY PRINCIPAL NO. 70 



Factor Proportion o-f 

Variance 



Rcle Rated 


0.0043 


Subject Tastad 




Grada Taught 


0.0103 


Ability oi Pupil 


0 


Interaction R X 

* 


S 


0 


Interaction R X 


G 


0.0130 


Interaction R X 


A 


0 


Interaction S X 


G 


0.0117 


Interaction S X 


A 


0 


Interaction G X 


A 


0.3246 


Interaction R X S X 


G 


0 


Interaction R X S X 


A 


0.O054 


Interaction R X G X 


A 


0 


Interaction S X G X 


A 


0.0090 


Interaction R X S X G X 


A 


0. 0686 


Residual Vari avion 


0.3529 


x TOTAL 


1 . 0000 
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TABLE 5 



FACTORS RELATED TO THE MAGNITUDES OF CORRELATIONS BETWEEN / 
PRINCIPALS' RATINGS OF TEACHERS AND EXPECTED ACHIEVEMENT GAINS / 

OF PUPILS IN THE TEACHERS' CLASSES / 
(Based on 24 Principals Who Rated Teachers in Two or More Grades) 



FACTOR PERCENT OF VARIANCE 



Teacher Role 


Rated 


1 L> 

1 ■ 6 


Subject 


Tested 




Brads 


Taught 


17.7 


Pupil Ability 


4.S 


Inter acti on 


R X 


S 


0. 1 


Interaction 


R X 


G 




Interaction 


R X 


A 


0.3 


Interaction 


S X 


G 


7.8 


Interaction 


S X 


A 


7. 1 


Interaction 


G X 


A 


10.3 


Interaction R X 


S X 


G 


0.5 


interaction R X 


S X 


A 


0.0 


Interaction R X 


G X 


A 


0.4 


Interaction S X 


G X 


A 


8.2 


Interaction R X S X 


G X 


A 


0.9 


Residual Variation 


33.0 




TOTAL 


100.0 



TABLE 6 



FACTORS RELATED TO THE MAGNITUDES OF CORRELATIONS BETWEEN 
PRINCIPALS" RATINGS OF TEACHERS AND EXPECTED ACHIEVEMENT GAINS 

OF PUPILS IN THE TEACHERS' CLASSES 



FACTOR PERCENT OF VARIANCE 

Teachers in Teachers in Two 
One Grade Only or More Grades 





Rated 


Rated 


Role Rated 


6.4 


3.3 


Subject Tested 


6.9 


8.0 


Pupil Ability 


7.9 


8.9 


Interaction R X S 


5.1 


0.2 


Interaction R X A 


6.2 


0.7 


Interaction S X A 


12.3 


14.0 


Interaction R X S X A 


2.2 


0.8 


Residual Variation 


53.0 


64. 1 


TOTAL 


100.0 


100. 0 
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Not* that in the analyses in which Grades was not a -factor all 
other -factors combined account -for less than half o-f the 
variation. The relationship o-f the ability level o-f the pupil 
whose expected gain ecore is correlated to the principal's 
judgment o-f teacher e-f -f ect i veness, for example, is small. 

More disturbing is the -fact that the teaching role rated has 
virtually no relationship to the magnitude o-f the correlations,' 
which suggests that judgments on Roles II and III must correlate 
with pupil gains just about as closely as Role I judgments. This 
is veri-fied in Table 7, which shows the mean correlations by role 
and grade. The importance o-f grade level and the unimportance o-f 
role are both clearly apparent here. 

Mean correlations seem to be high in odd-numbered grades (3 and 
3), and low in even-numbered 'grades (2, 4, and 6). We have no 
ready explanation o-f this phenomenon; it. may well be an art i -fact 
o-f the sample o-f principals. 



Distributions o-f Validity Coe-f * icients 

Figure 4 shows the distribution o-f Role I validity coe-f -f icients 
(i.e., correlations between principals' judgments o-f teacher 
e-f-fectiveness in Role I and expected gains o-f the average pupil 
in a grade) across the sample o-f 87 grade groups rated. <The 
picture is much the same -for judgments on Roles II and III, which 
are not shown.) The range is great, running (approximately) -from 
-.73 to +.83? and the distribution does not depart much -from 
normality. 

The analysis o-f variance shown in Table 8 was designed to 
indicate whether this wide range is evidence that some 
principals' judgments are more valid than those o-f other 
principals, as the -figure suggests. The F-ratio -for di -f -f erences 
between mean validities o-f di-f-ferent principals was only 1.19, 
which does not justi-fy rejection o-f the hypothesis that there are 
no di-f-f erences in the abilities o-f di-f-ferent principals to judge 
how e-f-fective a teacher is,. This conclusion is based on the -fact 
that judgments o-f teachers in di-f-feront grades by the same 
principal vary almost as much as judgments made by di-f-ferent 
principals. 

The F-ratio -for interaction between principal and role o-f 1.39 
is also small, so the hypotheses that it makes no di-f-f erence 
which role is being rated cannot be rejected either. 
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TABLE 7 



MEAN CORRELATIONS BETWEEN PRINCIPALS' RATINGS OF TEACHERS 
ON THREE ROLES AND EXPECTED GAINS OF PUPILS 
IN THE TEACHERS' CLASSES 



GRADE 


N 


ROLE 1 


ROLE 2 


ROLE 3 


AVERAGE 


2 


30 


0.20 


0. 13 


0.02 


0. 12 


3 


10 


/ 0.26 


0.24 


0.22 


0.24 


4 


12 


/ 0. 16 


0. 10 


0.05 


0. 10 


5 


It/ 


/ 0.23 


0.22 


0.25 


0.23 


6 




0. 17 


0.24 


0. 13 


0. 18 


OVERALL 


J 

. / 


0.20 


0. 19 


0. 13 


0. 17 






Correlation 




Frequency 





0.91 


to 


1 . 00 




0.81 


to 


0.90 


***** 


0.71 


to 


0 . 80 


**** 


0.61 


to 


0.70 


*** 


0.51 


to 


0.60 


****** 


0.41 


to 


0.50 


********** 


0.31 


to 


0.40 


************ 


0.21 


to 


0.30 


********* 


0.11 


to 


0.20 


****** 


0.01 


to 


0. 10 


****** 


-0.09 


to 


0.00 


****♦< 


-0. 19 


to 


-0. 10 


***** 


-0.29 


to 


-0.20 


*** 


-0.39 


to 


-0. 30 


***** 


-0.49 


to 


-0 . 40 


* ' 


-0.59 


to 


-0,50 


** 


-0.69 


to 


-0. 60 


* 


-0.79 


to 


-0.70 




TOTAL 




87 



DISTRIBUTION OF CORRELATIONS BETWEEN PRINCIPALS' 
ROLE I RATINGS OF TEACHERS AND EXPECTED ACHIEVEMENT 
GAINS OF STUDENTS IN THE TEACHERS' CLASSES 

FIGURE 4] 
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TABLE 8 

ANALYSIS OF VARIANCE OF CORRELATIONS BETWEEN PRINCIPALS' RATINGS 
OF 87 GROUPS OF TEACHERS ON THREE ROLES AND EXPECTED ACHIEVEMENT 
GAINS OF STUDENTS IN THE TEACHERS' CLASSES 



SOURCE OF VARIATION 


D.F. 


SUM OF 


MEAN 


F-RATIO 






SQUARES 


SQUARE 




Role Rated 


2 


0.3488 


0. 174 


0.48 


Principal 


43 


19.4288 


0.432 


1. 19 


Group (same principal) 


41 


14.9144 


0.364 


10.03» 


Interaction <RXP) 


90 


4.3356 


0.030 


1.39 


Residual variation 


82 


2.9670 


0.036 




TOTAL VARIATION 


260 , 


4271946 







/ 



r 
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What is significant (F » 10.05) is the difference between 
validities of judgments of groups of teachers of different grades 
made by the same principal. Since each such group is made up of 
different individual teachers, the safest conclusion to draw is 
that it is easier to judge differences in effectiveness of somr 
teachers than of others. 



Concluding Observations 

Despite our best efforts we have not been able to develop any 
credible evidence to indicate that principals' judgments of 
teacher effectiveness have any validity as predictors of how much 
pupils may be expected to learn about reading or arithmetic from 
them. The mean correlation between a principal's judgment of a 
teacher's effectiveness in teaching subject matter and expected 
achievement gains of the average pupil in that teacher's class in 
this study was only .20. A correlation of this size indicates 
that only four percent of the variance in principals' judgments 
reflects differences in teacher effectiveness; 96*/. of what these 
judgments indicate has nothing to do with teacher effectiveness. 

These data do little to encourage us to believe that how valid 
a principal's judgment i% depends on who the principal is. 
either. But here the small numbers of teachers in each group may 
be relevant. The range of estimated validities of different 
principals' judgments was very wide; but so was the range of 
estimates of validity of judgments made by the same principal. 

When we studied the variations in estimates of the validity of 
judgments made by a single principal , we found that the major 
source of such variation was differences between groups of 
teachers of different grades. The parsimonious interpretation of 
this is that it is harder to judge the effectiveness of some 
teachers than others, and that' this may be a function of grade 
taught. 

Teachers are being evaluated all over the place by methods that 
are not detectably better than chance. If decision^ about which 
teacher to certify, which to hire, which to award tenure to, and 
(soon) which deserve recognition as outstanding, were decided by 
a lottery they would be only a shade less accurate than the ones 
being made on the basis of principals' judgments of teacher 
effectiveness. It is time the profession accepted this 
disagreeable fact and did something about it. 
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NOTES 



If a principal rated groups 
would be (^./^correlations, 
roier,~~^~~Tevels of" pupil 
half -classes. 



of teachers in fir grades there 
corresponding to G grades;,~3~ 
ability, 2 subjects, and 2 



For readers interested in such matters, it should be noted 
that in three instances, separate estimates of the same 
correlation based on random halves o-f the same class were 
not available; hence 3 degrees o-f frredom for estimating 
residual variation were lost? the table therefore shows 43 
degrees of freedom for residual variation (instead of 48) 
and 92 degrees of freedom for total variation (instead of 
95) - 
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